Conversation
|
Just realized I havent udpated the README.md - I will do this now |
sam-grant
left a comment
There was a problem hiding this comment.
This is really great. I'm excited to see it working and I'm glad you were able to reuse pieces of the existing code.
| # Create a sample file list for demonstration | ||
| logger.log("\nCreating sample file list for demonstration...", "info") | ||
|
|
||
| # Use the MDS3a.txt file list provided in the repository |
There was a problem hiding this comment.
Can we reuse get_file_list from pyprocess?
https://github.com/Mu2e/pyutils/blob/main/pyutils/pyprocess.py#L65-L138
There was a problem hiding this comment.
Oh right this is a demo
There was a problem hiding this comment.
sure, I actually realized that I hadnt got that part complete. I'm working on it now. Should be done in a few mins
There was a problem hiding this comment.
I want to make notebooks, but I couldnt get the notebook to see my test pyenv, so I had to settle for this. Once its merged I will make some nicer interactive notebooks
setup.py
Outdated
| setup( | ||
| name="pyutils", | ||
| version="1.4.0", | ||
| version="1.8.0", |
There was a problem hiding this comment.
Ah, I guess we should actually bump to 1.9.0 or even 2.0.0 (major change), since we're on 1.8.0 with the current release.
| if sample_files and file_list_path: | ||
| logger.log("Using DaskProcessor with multi-file processing", "info") | ||
|
|
||
| data = processor.process_data( |
There was a problem hiding this comment.
Cool that we can use a similar interface
There was a problem hiding this comment.
yes, I know you wanted to keep things as is. And I was able to benchmark the two relative to one another - I'll show some stats on Wednesday
| client: Optional[Client] = None | ||
| created_client = False | ||
| try: | ||
| if scheduler_address: |
There was a problem hiding this comment.
Nice, so we just need to know the address and can connect to any scheduler on the network?
There was a problem hiding this comment.
yes, exactly. I think we need to wait for the EAF team on the centralized scheduler. For now we can work with local schedular/cluster. In that respect it doesnt really have too much advantage over the current pyprocess.py. But, I think that pydask will be more future-proof and once we have the resources at the EAF will help us a lot!
|
I have implemented your suggestions |
This pull request introduces Dask-based parallel file processing to the pyutils analysis framework, enabling scalable multi-file analysis and improved performance. The main changes are the addition of a new
DaskProcessorclass for parallel data processing, a comprehensive example script demonstrating its usage, and a version bump to reflect these new capabilities.Dask integration and parallel processing:
pyutils/pydask.pymodule providing theDaskProcessorclass, which mirrors the API ofProcessorbut uses Dask for parallel file processing. This allows users to process multiple files concurrently, either locally or on a distributed Dask cluster, with progress tracking and error resilience. ([pyutils/pydask.pyR1-R172](https://github.com/Mu2e/pyutils/pull/67/files#diff-b250e6c6661378cbc729a2da04b46f2d294e70508e4a275b7e0fd9cbcae9a15fR1-R172))Documentation and examples:
examples/scripts/pyutils_basics_dask.pyscript demonstrating how to useDaskProcessorfor multi-file analysis, including selection cuts, data inspection, vector operations, and plotting. The script highlights the advantages of Dask-based processing and provides a step-by-step guide for users. ([examples/scripts/pyutils_basics_dask.pyR1-R300](https://github.com/Mu2e/pyutils/pull/67/files#diff-d8098ac33b1267668f9bff146f303a655c199c355841082bd7130c69ee4a3131R1-R300))Version update:
setup.pyto bump the package version from 1.4.0 to 1.8.0, reflecting the addition of Dask support and new features. ([setup.pyL5-R5](https://github.com/Mu2e/pyutils/pull/67/files#diff-60f61ab7a8d1910d86d9fda2261620314edcae5894d5aaa236b821c7256badd7L5-R5))